Statistical unicodification of African languages

نویسنده

  • Kevin P. Scannell
چکیده

Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: i., o. , u. ) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ε, O). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an opensource package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Literary Anthroponomastics of Three Selected African Novels: A Cross Cultural Perspective

Names as markers of identity are a source of a wide variety of information. This paper explores the names of characters to show the sociocultural factors which influence the choice of names and the effects that the names of these characters have on the roles they play. Using a variety of personal names from Ayi Kwei Armah’s Fragments, Buchi Emecheta’s The Joys of Motherhood, a...

متن کامل

Lexical Semantics and Selection of TAM in Bantu Languages: A Case of Semantic Classification of Kiswahili Verbs

The existing literature on Bantu verbal semantics demonstrated that inherent semantic content of verbs pairs directly with the selection of tense, aspect and modality formatives in Bantu languages like Chasu, Lucazi, Lusamia, and Shiyeyi. Thus, the gist of this paper is the articulation of semantic classification of verbs in Kiswahili based on the selection of TAM types. This is because the sem...

متن کامل

Identity and Representation through Language in Ghana: The Postcolonial Self and the Other

Research related to colonialism and post colonialism shows how the identities of indigenous people were constructed and how these identities are reconstructed in our contemporary world. The thrust of this paper is that colonialism brought a shift in the linguistic structure of Ghana with the introduction of the use of English among Ghanaians. The coexistence of both Ghanaian languages and Engli...

متن کامل

Exploring unsupervised word segmentation for machine translation in the South African context

We explore the application of unsupervised word segmentation algorithms to phrase-based statistical machine translation (SMT) systems, translating from English to four South African languages: Afrikaans, Northern Sotho, Tsonga and Zulu. Positive results in terms of the standard BLEU and NIST scores are obtained for systems translating into Afrikaans and Zulu.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 45  شماره 

صفحات  -

تاریخ انتشار 2011